{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "slideshow": {
     "slide_type": "notes"
    }
   },
   "source": [
    "# 中文词自动纠错讲解\n",
    "## 本文档主要演示如何通过python实现一个中文词组的自动纠错，如输入“咳数”，输出“咳嗽”\n",
    "## 程序原理：\n",
    "### 给定一待纠错词w,我们需要从一系列候选词中选出一最可能的。也就是：argmax(p(c|w)), c in 候选词表。根据贝叶斯原理，p(c|w) = p(w|c) * p(c) / p(w). 又对任意可能的c,p(w)一样，故也就是求使argmax(p(w|c) * p(c))成立的c.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "# -*- coding:utf-8 -*-\n",
    "__author__ = 'liupenghe'\n",
    "\n",
    "import os\n",
    "import collections\n",
    "import jieba\n",
    "from sxpCi import ci_list"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 初始化所有潜在中文词的先验概率 文本集：50篇医学文章分词后，统计各个中文词的出现频率即为其先验概率"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Building prefix dict from /Library/Python/2.7/site-packages/jieba/dict.txt ...\n",
      "Dumping model to file cache /var/folders/30/cd4n0jcj4_1gnjml7xh8nf500000gn/T/jieba.cache\n",
      "Loading model cost 1.939 seconds.\n",
      "Prefix dict has been built succesfully.\n"
     ]
    }
   ],
   "source": [
    "#对给定的语料库分词\n",
    "def cn_ci(dir_path):\n",
    "    for rdf in ci_list:\n",
    "        jieba.add_word(rdf[0])\n",
    "    all_text = u\"\"\n",
    "    for file_name in os.listdir(dir_path):\n",
    "        if file_name.find(\".txt\") != -1:\n",
    "            file_path = \"/\".join([dir_path, file_name])\n",
    "            with open(file_path, \"r\") as f:\n",
    "                all_text += f.read().decode(\"utf-8\")\n",
    "\n",
    "    terms = jieba.cut(all_text)\n",
    "\n",
    "    return [ci for ci in ','.join(terms).split(',') if ci not in [u'', u\" \"]]\n",
    "\n",
    "\n",
    "#统计语料库中各个单词出现的概率\n",
    "def cn_train(features):\n",
    "    model = collections.defaultdict(lambda: 1)\n",
    "    for f in features:\n",
    "        model[f] += 1\n",
    "    return model\n",
    "CNWORDS = cn_train(cn_ci(\"cn_texts\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### CNWORDS 即为我们的单词模型，其为字典格式， key为单词，value为该单词先验概率（词频）。另外，为了弥补可能出现的新词，我们做了平滑处理，对新词默认其出现频率为1."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "第二 27\n",
      "小腿 26\n",
      "编译 16\n"
     ]
    }
   ],
   "source": [
    "# 查看其中的3个验证一下\n",
    "for w, wf in CNWORDS.items()[:3]:\n",
    "    print w, wf"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 当给定一待纠错单词时，我们需要找出可能的正确单词列表。这里我们根据字符距离来找出可能的正确单词列表，我们来回顾一下两个单词之间的字符距离，如果一个单词转变为另一个单词需要编辑n下，如删掉一个字符，替换一个字符，交换两个字符位置，增减一个字符，那么我们就说这两个单词间的字符距离为n。考虑到中文的特殊性，这里我将中文的一个字看成一个字符，当然中文字符会比26个英文字母要多的多。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#载入所有中文字\n",
    "def cn_hanzi():\n",
    "    with open(\"hanzi.txt\", \"r\") as f:\n",
    "        hanzi = f.read().decode(\"utf-8\")\n",
    "        return hanzi\n",
    "\n",
    "cn_hanzi_str = cn_hanzi()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "#根据字符距离构造可能的单词列表，这里只计算与待检查单词字符距离为1的单词\n",
    "def cn_edits1(ci):\n",
    "    splits     = [(ci[:i], ci[i:]) for i in range(len(ci)   + 1)]\n",
    "    # 比待检查单词少一个字的单词\n",
    "    deletes    = [a + b[1:] for a, b in splits if b if a + b[1:] in CNWORDS] \n",
    "    # 交换待检查单词中的任意相邻两个字的位置\n",
    "    transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1 if a + b[1] + b[0] + b[2:] in CNWORDS] \n",
    "    # 使用所有中文字替换待检查字中的某个字\n",
    "    replaces   = [a + c + b[1:] for a, b in splits for c in cn_hanzi_str if b if a + c + b[1:] in CNWORDS]\n",
    "    # 向待检查词中插入新字\n",
    "    inserts    = [a + c + b     for a, b in splits for c in cn_hanzi_str if a + c + b in CNWORDS]\n",
    "    return set(deletes + transposes + replaces + inserts)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 由于中文字有5000多个，因而由字符距离1来构造出来的可能候选单词列表将会很大，因而我们对构造出来的单词做了一次验证后再将其加入候选集合中，即我们判断了下该词是否为有效单词，根据其是否在我们的单词模型中。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 如果觉得只考虑单词距离为1的单词数量不够用，我们可以继续加入与待检查单词距离为2的单词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def cn_known_edits2(ci):\n",
    "    return set(e2 for e1 in cn_edits1(ci) for e2 in cn_edits1(e1) if e2 in CNWORDS)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 到这一步，我们的模型基本构造完成，完成最后的修改函数吧。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def cn_correct(ci):\n",
    "    # 候选词列表\n",
    "    candidates = cn_edits1(ci) or cn_known_edits2(ci)\n",
    "    # 找出其中概率最大的单词\n",
    "    return max(candidates, key=CNWORDS.get)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 测试一下吧"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "咳嗽\n"
     ]
    }
   ],
   "source": [
    "print cn_correct(u'咳数')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "呕吐\n"
     ]
    }
   ],
   "source": [
    "print cn_correct(u'呕土')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "虽然\n"
     ]
    }
   ],
   "source": [
    "print cn_correct(u'传然')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "感染\n"
     ]
    }
   ],
   "source": [
    "print cn_correct(u'感帽')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 从上面的结果我们可以看出来，程序可以将我们打错的单词自动修改成正确的单词。但不难发现，程序仍然存在问题，如我们打“传然”时，我们的本意可能是“传染”，然而程序却改成了“虽然”；我们打“感帽”我们的本意是“感冒”，却被修改为了“感染”。因而考虑从以下几个方向改进：\n",
    "- 1,考虑人们的打字习惯，人们通常越往后字打错的可能越大，因而可以考虑每个字在单词中的位置给予一定权重，这中方法有助于改进上面的第一种“传然”－ \"虽然\"的情况；\n",
    "- 2,考虑拼音的重要性，对汉语来讲，通常人们打错时拼音是拼对的，只是选择时候选择错了，因而对候选词也可以优先选择同拼音的字。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "参考资料：http://norvig.com/spell-correct.html  这时google大牛写的英文单词的自动拼写，本程序主要参考其代码实现。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}
